from IPython.core.display import display, HTML, Image
display(Image('images/sentimentplot.png'))
Part 1: Analyzing Trends by Time
Part 2: Discovering Channel Keywords
Part 3: Channel Clustering
Part 4: [Sentiment Analysis]
How do you measure sentiment? Now this is a really tricky subject.
Sentiment Analysis is not an exact science, but there are a number of metrics that attempt to quantify 'sentiment' of words. For social media and microblogs, a dictionary-based sentiment module ANEW (Affective Norms for English Words) has had some success. The ANEW dictionary is a list constructed by Bradley and Lang from University of Florida, and contains 1034 words rated for valence, arousal and dominance.
The two main measures of ANEW are
Let's see a couple demonstrations to get a better idea.
from anew_module import anew
A word like 'fun' has high arousal and high valence, compared to 'boring', which has low arousal, low valence
print anew.sentiment('fun')
print anew.sentiment('boring')
'Love' and 'Hate' score similarly on arousal because they are both words with a lot of energy. They are on opposites however, on valence measures.
print anew.sentiment('love')
print anew.sentiment('hate')
import pandas as pd
import re
import string
from anew_module import anew
data = pd.read_csv('data/concatenated.csv', header = 0)
Before applying ANEW, we will convert all words to lower case and remove punctuation.
punc = re.compile( '[%s]' % re.escape( string.punctuation ) )
data['cleaned'] = data['text'].apply(lambda x: punc.sub(' ',str(x).lower()))
data['arousal'] = data['cleaned'].apply(lambda x: anew.sentiment(str(x).split())['arousal'])
data['valence'] = data['cleaned'].apply(lambda x: anew.sentiment(str(x).split())['valence'])
Let's examine how the sentiments are distributed.
import matplotlib.pyplot as plt
%matplotlib inline
ax = data['arousal'].hist(bins = 10, range = (0,10));
ax.set_title('Distribution of Arousal Scores');
ax.set_xlabel('ANEW Arousal');
ax.set_ylabel('Post Counts');
There are a bunch of posts with 0 score, and others mostly centered around 4-7. When there is a sentiment, it seems people are on the more high end than the low end though.
ax = data['valence'].hist(bins = 10, range = (0,10));
ax.set_title('Distribution of Valence Scores');
ax.set_xlabel('ANEW Valence');
ax.set_ylabel('Post Counts');
Again, a bunch with 0, but mostly between 4-7. We have more 9s than the graph above.
Now comes the real exciting part. Let's map out the sentiment of every single post! We'll plot a 2-dimensional plot of the scores, segmented by channel. I've created this with the python library plotly and embedded it as an HTML.
import plotly
import plotly.plotly as py
plotly.offline.init_notebook_mode() # run at the start of every notebook
from IPython.core.display import display, HTML, Image
display(HTML('images/sentiment_visualizer.html'))
(You can click or unclick each channel to see how channels compare in their sentiment) If the HTML above doesn't load, you see the plot at: https://plot.ly/~elimelim/2.embeda
To wrap it up, sentiment analysis is not an exact science. Sentiment is complex. Sentiment is also very contextual. A dictionary based sentiment cannot possibly capture sarcasm, or different level of valence of a word used in a different context. However, these measures give an overarching baseline, and can be used to see how one text measures relative to another.
And that wraps up the series on Slack analysis. I hope you enjoyed this microblog!